The Anjials of Statistics 

2011, Vol. 39, No. 2, 1180-1210 

DOI: 10.1214/10-AOS864 

@ Institute of Mathematical Statistics, 2011 



PERFORMANCE GUARANTEES FOR INDIVIDUALIZED 
TREATMENT RULES^ 

By Min Qian and Susan A. Murphy 

University of Michigan 

Because many illnesses show heterogeneous response to treat- 
ment, there is increasing interest in individualizing treatment to pa- 
tients [Arch. Gen. Psychiatry 66 (2009) 128-133]. An individualized 
treatment rule is a decision rule that recommends treatment accord- 
ing to patient characteristics. We consider the use of clinical trial 
data in the construction of an individualized treatment rule leading 
to highest mean response. This is a difficult computational problem 
because the objective function is the expectation of a weighted indica- 
tor function that is nonconcave in the parameters. Furthermore, there 
are frequently many pretreatment variables that may or may not be 
useful in constructing an optimal individualized treatment rule, yet 
cost and interpretability considerations imply that only a few vari- 
ables should be used by the individualized treatment rule. To address 
these challenges, we consider estimation based on Zi-penalized least 
squares. This approach is justified via a finite sample upper bound on 
the difference between the mean response due to the estimated indi- 
vidualized treatment rule and the mean response due to the optimal 
individualized treatment rule. 

1. Introduction. Many illnesses show heterogeneous response to treat- 
ment. For example, a study on schizophrenia [12] found that patients who 
take the same antipsychotic (olanzapine) may have very different responses. 
Some may have to discontinue the treatment due to serious adverse events 
and/or acutely worsened symptoms, while others may experience few if any 
adverse events and have improved clinical outcomes. Results of this type 
have motivated researchers to advocate the individualization of treatment 
to each patient [11, 16, 23]. One step in this direction is to estimate each pa- 
tient's risk level and then match treatment to risk category [5, 6]. However, 
this approach is best used to decide whether to treat; otherwise it assumes 
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the knowledge of the best treatment for each risk category. Alternately, there 
is an abundance of literature focusing on predicting each patient's prognosis 
under a particular treatment [10, 28]. Thus, an obvious way to individual- 
ize treatment is to recommend the treatment achieving the best predicted 
prognosis for that patient. In general, the goal is to use data to construct 
individualized treatment rules that, if implemented in future, will optimize 
the mean response. 

Consider data from a single stage randomized trial involving several active 
treatments. A first natural procedure to construct the optimal individual- 
ized treatment rule is to maximize an empirical version of the mean response 
over a class of treatment rules (assuming larger responses are preferred). As 
will be seen, this maximization is computationally difficult because the mean 
response of a treatment rule is the expectation of a weighted indicator that 
is noncontinuous and nonconcave in the parameters. To address this chal- 
lenge, we make a substitution. That is, instead of directly maximizing the 
empirical mean response to estimate the treatment rule, we use a two-step 
procedure that first estimates a conditional mean and then from this esti- 
mated conditional mean derives the estimated treatment rule. As will be 
seen in Section 3, even if the optimal treatment rule is contained in the 
space of treatment rules considered by the substitute two-step procedure, 
the estimator derived from the two-step procedure may not be consistent. 
However, if the conditional mean is modeled correctly, then the two-step 
procedure consistently estimates the optimal individualized treatment rule. 
This motivates consideration of rich conditional mean models with many 
unknown parameters. Furthermore, there are frequently many pretreatment 
variables that may or may not be useful in constructing an optimal indi- 
vidualized treatment rule, yet cost and interpretability considerations imply 
that fewer rather than more variables should be used by the treatment rule. 
This consideration motivates the use of Zi-penalized least squares (Zi-PLS). 

We propose to estimate an optimal individualized treatment rule using 
a two step procedure that first estimates the conditional mean response 
using /i-PLS with a rich linear model and second, derives the estimated 
treatment rule from estimated conditional mean. For brevity, throughout, 
we call the two step procedure the /i-PLS method. We derive several finite 
sample upper bounds on the difference between the mean response to the 
optimal treatment rule and the mean response to the estimated treatment 
rule. All of the upper bounds hold even if our linear model for the condi- 
tional mean response is incorrect and to our knowledge are, up to constants, 
the best available. We use the upper bounds in Section 3 to illuminate the 
potential mismatch between using least squares in the two-step procedure 
and the goal of maximizing the mean response. The upper bounds in Sec- 
tion 4.1 involve a minimized sum of the approximation error and estimation 
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error; both errors result from the estimation of the conditional mean re- 
sponse. We shall see that /i-PLS estimates a linear model that minimizes 
this approximation plus estimation error sum among a set of suitably sparse 
linear models. 

If the part of the model for the conditional mean involving the treatment 
effect is correct, then the upper bounds imply that, although a surrogate 
two-step procedure is used, the estimated treatment rule is consistent. The 
upper bounds provide a convergence rate as well. Furthermore, in this set- 
ting, the upper bounds can be used to inform how to choose the tuning 
parameter involved in the li penalty to achieve the best rate of convergence. 
As a by-product, this paper also contributes to existing literature on Zi-PLS 
by providing a finite sample prediction error bound for the Zi-PLS estimator 
in the random design setting without assuming the model class contains or 
is close to the true model. 

The paper is organized as follows. In Section 2, we formulate the decision 
making problem. In Section 3, for any given decision, that is, individualized 
treatment rule, we relate the reduction in mean response to the excess pre- 
diction error. In Section 4, we estimate an optimal individualized treatment 
rule via li-PLS and provide a finite sample upper bound on the reduction 
in mean response achieved by the estimated rule. In Section 5, we con- 
sider a data dependent tuning parameter selection criterion. This method 
is evaluated using simulation studies and illustrated with data from the 
Nefazodone-CBASP trial [13]. Discussions and future work are presented in 
Section 6. 

2. Individualized treatment rules. We use upper case letters to denote 
random variables and lower case letters to denote values of the random vari- 
ables. Consider data from a randomized trial. On each subject, we have the 
pretreatment variables X € X, treatment A taking values in a finite, discrete 
treatment space A, and a real-valued response R (assuming large values are 
desirable). An individualized treatment rule (ITR) d is a deterministic deci- 
sion rule from X into the treatment space A. 

Denote the distribution of {X, A, R) by P. This is the distribution of the 
clinical trial data; in particular, denote the known randomization distribu- 
tion of A given X by p{-\X). The likelihood of {X,A,R) under P is then 
fo{x)p{a\x)fi{r\x,a), where /o is the unknown density of X and /i is the 
unknown density of R conditional on {X, A) . Denote the expectations with 
respect to the distribution P by an E. For any ITR d: X ^ A, let P'^ de- 
note the distribution of {X,A,R) in which d is used to assign treatments. 
Then the likelihood of {X,A,R) under P'^ is fo{x)la=d{x) fi{r\ x,a). Denote 
expectations with respect to the distribution P'^ by an E'^. The Value of d is 
defined as V{d) = E'^{R). An optimal ITR, do, is a rule that has the maximal 
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Value, that is, 



do G argmaxF((i), 
d 

where the argmax is over all possible decision rules. The Value of do, V{do), 
is the optimal Value. 

Assume P[p{a\X) > 0] = 1 for all a £ A (i.e., all treatments in A are 
possible for all values of X a.s.). Then P'^ is absolutely continuous with 
respect to P and a version of the Radon-Nikodym derivative is dP'^/dP = 
la=d{x)/p{0'\^)- Thus, the Value of d satisfies 



(2.1) V{d) = E'^{R) = / RdP 



dP'^ , 

•'Hp'"'' 



E 



^A=d{X) 



R 



Our goal is to estimate do, that is, the ITR that maximizes (2.1), using data 
from distribution P. When X is low dimensional and the best rule within 
a simple class of ITRs is desired, empirical versions of the Value can be used 
to construct estimators [22, 26]. However, if the best rule within a larger 
class of ITRs is of interest, these approaches are no longer feasible. 

Define Qq{X,A) = E{R\X,A) [QQ{x,a) is sometimes called the "Quality" 
of treatment a at observation x\. It follows from (2.1) that for any ITR d, 

'I, 



V{d) = E 



^A=d{X) 



Qo{X,A) 



E 



^ ^d{x)=aQo{X,a) 



E[QoiX,d{X))]. 



Thus, V{do) = E[QoiX,do{X))] < E[maXaeAQo{X,a)]. On the other hand, 
by the definition of do. 



V{do)>V{d)U^x) 



GargmaXjjg_4Qo{^ia) 



maxQolX, a) 
- aeA 



Hence, an optimal ITR satisfies do(X) G argmax^g^^ QQ[X,a) a.s. 



3. Relating the reduction in Value to excess prediction error. The above 
argument indicates that the estimated ITR will be of high quality (i.e., have 
high Value) if we can estimate Qo accurately. In this section, we justify 
this by providing a quantitative relationship between the Value and the 
prediction error. 

Because ^ is a finite, discrete treatment space, given any ITR, d, there 
exists a square integrable function Q :X x ^ — M for which d{X) G 
arg maxa Q{X,a) a.s. Let L{Q) = E[R - Q{X, A)]'^ denote the prediction 
error of Q (also called the mean quadratic loss). Suppose that Qo is square 
integrable and that the randomization probability satisfies p{a\x) > for 
an 5 > and all (x, a) pairs. Murphy [21] showed that 



(3.1) 



Vido) - V{d) < 25V2[L(Q) - L{Qop\ 
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Intuitively, this upper bound means that if the excess prediction error of Q 
[i.e., L[Q) — L[Qq)\ is small, then the reduction in Value of the associated 
ITR d [i.e., V{do) — V{d)\ is small. Furthermore, the upper bound provides 
a rate of convergence for the Value of an estimated ITR. For example, sup- 
pose Qo is linear, that is, Qo = ^{X^A)Oq for a given vector-valued basis 
function ^ on X x A and an unknown parameter 6q. And suppose we use 
a correct linear model for Qq (here "linear" means linear in parameters), 
say the model Q = {^{X, A)6 : € M"^™{*)} or a linear model containing Q 
with dimension of parameters fixed in n. If we estimate 6 by least squares 
and denote the estimator by 0, then the prediction error of Q = ^0 con- 
verges to L{Qq) at rate 1/n under mild regularity conditions. This together 
with inequality (3.1) implies that the Value obtained by the estimated ITR, 
d(X) € argmaxa Q{X, a), will converge to the optimal Value at rate at least 

In the following theorem, we improve this upper bound in two aspects. 
First, we show that an upper bound with exponent larger than 1/2 can be 
obtained under a margin condition, which implicitly implies a faster rate of 
convergence. Second, it turns out that the upper bound need only depend 
on one term in the function Q; we call this the treatment effect term T. For 
any square integrable Q, the associated treatment effect term is defined as 
T{X,A) = Q{X,A) - E[Q{X,A)\X]. Note that d{X) £ arg max^ r(X, a) = 
argmaXaQ{X,a) a.s. Similarly, the true treatment effect term is given by 

(3.2) To{X,A) ^Qo{X,A) - E[Qo{X,A)\X]. 

TQ{x,a) is the centered effect of treatment A = a at observation X = x; 
do{X) G arg maX(j To (X, a) . 

Theorem 3.1. Suppose p{a\x) > for a positive constant S for all 
(x,o) pairs. Assume there exists some constants C > and a > such that 

(3.3) p(maxTo(X,a) - max To{X,a) < e) < Ce'^ 

for all positive e. Then for any ITR d:X^A and square integrable function 
Q:X X T-M such that d{X) E aigm.ayia^AQ{X,a) a.s., we have 

(3.4) V{do) - V{d) < C'[L{Q) - L(Qo)]^'+")/^'+") 
and 

(3.5) V{do)-V{d)<C'[E{T{X,A)-To{X,A)ff+'''^/^^+''\ 
where C = (22+3«5i+ac7)i/(2+a) _ 

The proof of Theorem 3.1 is in Appendix A.l. 
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Remarks. 

(1) We set the second maximum in (3.3) to — oo if for an x, To(x,a) is 
constant in a and thus the set ^\ argmaXag_4To(x,a) = 0. 

(2) Condition (3.3) is similar to the margin condition in classification 
[18, 24, 32]; in classification this assumption is often used to obtain sharp 
upper bounds on the excess 0-1 risk in terms of other surrogate risks [2]. 
Here mayia&ATo{x,a) - maXag_4\argmax,6^To(x,a) ^0(2;, «) can be viewed as 
the "margin" of Tq at observation X = x. It measures the difference in mean 
responses between the optimal treatment (s) and the best suboptimal treat- 
ment(s) at x. For example, suppose X ~ C/[— 1,1], P{A = 1\X) = P{A = 
— 1\X) = 1/2 and Tq{X,A) = XA. Then the margin condition holds with 
C = 1/2 and a = 1. Note the margin condition does not exclude multiple 
optimal treatments for any observation x. However, when q > 0, it does 
exclude suboptimal treatments that yield a conditional mean response very 
close to the largest conditional mean response for a set of x with nonzero 
probability. 

(3) For C = 1,Q = 0, condition (3.3) always holds for all e > 0; in this 
case (3.4) reduces to (3.1). 

(4) The larger the a, the larger the exponent (1 + a)/(2 + a) and thus the 
stronger the upper bounds in (3.4) and (3.5). However, the margin condition 
is unlikely to hold for all e if a is very large. An alternate margin condition 
and upper bound are as follows. 

Suppose p{a\x) > for all {x, a) pairs. Assume there is an e>0, such that 



Then V{do) - V{d) < 4S[L{Q) - L{Qo)]/e and V{do) - V{d) < ASE{T - 



The proof is essentially the same as that of Theorem 3.1 and is omitted. 
Condition (3.6) means that Tq evaluated at the optimal treatment(s) mi- 
nus To evaluated at the best suboptimal treatment(s) is bounded below by 
a positive constant for almost all X observations. If X assumes only a finite 
number of values, then this condition always holds, because we can take e to 
be the smallest difference in Tq when evaluated at the optimal treatment (s) 
and the suboptimal treatment(s) [note that if To(x,a) is constant for all 
a£ A for some observation X = x, then all treatments are optimal for that 
observation] . 

(5) Inequality (3.5) cannot be improved in the sense that choosing T = 
Tq yields zero on both sides of the inequality. Moreover, an inequality in 
the opposite direction is not possible, since each ITR is associated with 
many nontrivial T-functions. For example, suppose X~C/[— 1,1], P(A = 



l\X) = P{A = -1\X) = 1/2 and Tq{X,A) = (X - 1/3)^ A. The optimal ITR 



(3.6) 
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is dQ{X) = 1 a.s. Consider T{X,A) = OA. Then maximizing T{X,A) yields 
the optimal ITR as long as 6 > 0. This means that the left-hand side (LHS) 
of (3.5) is zero, while the right-hand side (RHS) is always positive no matter 
what value 6 takes. 

Theorem 3.1 supports the approach of minimizing the estimated predic- 
tion error to estimate Qq or Tq and then maximizing this estimator over 
o G ^ to obtain an ITR. It is natural to expect that even when the approx- 
imation space used in estimating Qq or Tq does not contain the truth, this 
approach will provide the best (highest Value) of the considered ITRs. Unfor- 
tunately, this does not occur due to the mismatch between the loss functions 
(weighted 0-1 loss and the quadratic loss). This mismatch is indicated by 
remark (5) above. More precisely, note that the approximation space, say Q 
for Qq, places implicit restrictions on the class of ITRs that will be consid- 
ered. In effect, the class of ITRs is Vq = {d{X) G argmax^ Q{X, a):Q£ Q}. 
It turns out that minimizing the prediction error may not result in the ITR 
in Dq that maximizes the Value. This occurs when the approximation space 
Q does not provide a treatment effect term close to the treatment effect term 
in Qq. In the following toy example, the optimal ITR do belongs to Vq, yet 
the prediction error minimizer over Q does not yield do- 

A TOY EXAMPLE. Suppose X is uniformly distributed in [—1,1], A is 
binary {—1,1} with probability 1/2 each and is independent of X, and R 
is normally distributed with mean Qq{X,A) = {X — 1/3)'^ A and variance 1. 
It is easy to see that the optimal ITR satisfies dQ{X) = 1 a.s. and V{dQ) = 
4/9. Consider approximation space Q = {Q{X,A;9) = {1,X,A,XA)6:0 e 
M^} for Qq. Thus the space of ITRs under consideration is Vq = {d{X) = 
sign(03 -|- O^X) '.63, 04 G M}. Note that do S since dQ{X) can be writ- 
ten as sign(03 -|- O^X) for any 63 > and 64 = 0. do is the best treat- 
ment rule in Vq. However, minimizing the prediction error L{Q) over Q 
yields Q*iX,A) = (4/9 - 2/3X)A. The ITR associated with Q* is d*{X) = 
argmaXag|_i 1} Q*(X, a) = sign(2/3 — X), which has lower Value than do 
{V{d*) = ^[ i^(2/3-^)>o^ ] = 29/81 < V{do)). 

4. Estimation via Zi-penalized least squares. To deal with the mismatch 
between minimizing the prediction error and maximizing the Value discussed 
in the prior section, we consider a large linear approximation space Q for Qq. 
Since overfitting is likely (due to the potentially large number of pretreat- 
ment variables and/or large approximation space for Qq), we use penalized 
least squares (see Section S.l of the supplemental article [25] for further dis- 
cussion of the overfitting problem). Furthermore, we use /i-penalized least 
squares (/i-PLS, [31]) as the /i penalty does some variable selection and as 
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a result will lead to ITRs that are cheaper to implement (fewer variables to 
collect per patient) and easier to interpret. See Section 6 for the discussion 
of other potential penalization methods. 

Let {{Xi, Ai, Ri)}^^^ represent i.i.d. observations on n subjects in a ran- 
domized trial. For convenience, we use En to denote the associated empirical 
expectation [i.e., E^f = f{Xi,Ai,Ri)/n for any real-valued function / 
on X X AxM.]. Let Q = {Q{X,A;e) = <P{X,A)e,ee M^} be the approxi- 
mation space for Qq, where ^{X, A) = {(j)i{X, A), . . . ,(f>j{X,A)) is a 1 by J 
vector composed of basis functions on X x A, 9 is a J hy 1 parameter vector, 
and J is the number of basis functions (for clarity here J will be fixed in n, 
see Appendix A. 2 for results with J increasing as n increases). The Zi-PLS 
estimator of 6 is 

(4.1) 0„ = arg min ( E„ [i? - $ (X, A) 0] 2 + A„ V a, I 1 1 , 

where aj = [En(pj{X, A)'^]^^'^ , 0j is the jth component of 6 and A„ is a tuning 
parameter that controls the amount of penalization. The weights (Tj's are 
used to balance the scale of different basis functions; these weights were 
used in Bunea, Tsybakov and Wegkamp [4] and van de Geer [33]. In some 
situations, it is natural to penalize only a subset of coefficients and/or use 
different weights in the penalty; see Section S.2 of the supplemental artic- 
le [25] for required modifications. The resulting estimated ITR satisfies 

(4.2) dn{X) £ arg max ^{X,a)6n. 

4.1. Performance guarantee for the li-PLS. In this section, we provide 
finite sample upper bounds on the difference between the optimal Value and 
the Value obtained by the Zi-PLS estimator in terms of the prediction errors 
resulting from the estimation of Qq and Tq. These upper bounds guarantee 
that if Qo (or Tq) is consistently estimated, the Value of the estimated 
ITR will converge to the optimal Value. Perhaps more importantly, the 
finite sample upper bounds provided below do not require the assumption 
that either Qq or Tq is consistently estimated. Thus, each upper bound 
includes approximation error as well as estimation error. The estimation 
error decreases with decreasing model sparsity and increasing sample size. 
An "oracle" model for Qq (or Tq) minimizes the sum of these two errors 
among suitably sparse linear models [see remark (2) after Theorem 4.3 for 
a precise definition of the oracle model]. In finite samples, the upper bounds 
imply that dn, the ITR produced by the /i-PLS method, will have Value 
roughly as if the /i-PLS method detects the sparsity of the oracle model 
and then estimates from the oracle model using ordinary least squares [see 
remark (3) below]. 
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Define the prediction error minimizer e* G by 
(4.3) 6* GargminL($0)=argminS(i?-$0)^. 



For expositional simpHcity assume that 9* is unique, and define the spar- 
sity of G M'^ by its Iq norm, \\0\\q (see Appendix A. 2 for a more general 
setting, where 6* is not unique and a laxer definition of sparsity is used). 
As discussed above, for finite n, instead of estimating 0* , the /i-PLS es- 
timator On estimates a parameter 0**, possessing small prediction error 
and with controlled sparsity. For any bounded function / on x A, let 
, = sup^g;^' „£_4 |/(x, a)| . 0** lies in the set of parameters ©„ defined 



by 



(4.4) 



max 
i=i,...,j 



< 11?? 



and ||0||o < 



/3 



log (Jn) 



n 



n 



A89U Y log( Jn) 



where aj = {E(f>j)^/'^, and rj, /3 and U are positive constants that will be 
defined in Theorem 4.1. 

The first two conditions in (4.4) restrict 0„ to O s with controlled distance 
in sup norm and with controlled distance in prediction error via first order 
derivatives (note that \E[^j<^ie - e*)]/aj\ = \dL{^G)/dej - dL{^e*)/de*\/ 
2a j). The third condition restricts 0^ to sparse 0's. Note that as n in- 
creases this sparsity requirement becomes laxer, ensuring that 9* G Qn for 
sufficiently large n. 

When Qn is nonempty, 0** is given by 

(4.5) e*n* = argmin[L(^>6>) + 3||6>||oA2 //?]. 



Note that e*n* ' 



■3||0||oA2//3> 



is at least as sparse as 6* since by (4.3), L{^6) 
L{^e*) + 3||0*||oA2//3 for any 6 such that ||0||o > ||0*||o. 

The following theorem provides a finite sample performance guarantee for 
the ITR produced by the /i-PLS method. Intuitively, this result implies that 
if Qq can be well approximated by the sparse linear representation 0** [so 
that both L[^9^) — L{Qq) and ||0Ji*||o are small], then d„ will have Value 
close to the optimal Value in finite samples. 



Theorem 4.1. Suppose p{a\x) > for a positive constant S for all 
{x,a) pairs and the margin condition (3.3) holds for some C > 0, a > and 
all positive e. Assume: 
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(1) the error terms ei = Ri — QQ{Xi,Ai),i = 1, . . . are independent of 
(Xj, = 1, . . . ,n and are i.i.d. withE{ei) = Q and E\\eif\ <l\c^"'^a'^ /2 
for some c, cr^ > for all I > 2; 

(2) there exist finite, positive constants U and r] such that maxj=i^...^j 
llf^jlloo/c^j < U and \\Qo - ^0*\\oo < ??; and 

(3) E[{(j)i/ai, . . . , (pj / cfjY' {(t)i/ CFi, . . . , /aj)] is positive definite, and the 
smallest eigenvalue is denoted by (3. 

Consider the estimated ITR dn defined by (4 -3) with tuning parameter 



where A; = 82max{c, a, 77}. Let Qn be the set defined in (4-4)- Then for any 
n > 24[7^ log( Jn) and for which 6„ is nonempty, we have, with probability 
at least 1 — 1/n, that 



The result follows from inequality (3.4) in Theorem 3.1 and inequality 
(4.10) in Theorem 4.3. Similar results in a more general setting can be 
obtained by combining (3.4) with inequality (A. 7) in Appendix A. 2. 

Remarks. 

(1) Note that 0** is the minimizer of the upper bound on the RHS of 
(4.7) and that 0** is contained in the set {On^"^^ ■m C {1,...,J}}. Each 

6n^ satisfies On = argmin|0g0^.0^=o for all j(^m}L{^6); that is, 
minimizes the prediction error of the model indexed by the set m (i.e., 
model {J^jem^j^j • ^ ^}) (within 0„). For each dn^"^\ the first term in 

the upper bound in (4.7) [i.e., L{(^6n^^^) — L(Qo)] is the approximation 
error of the model indexed by m within 0„ . As in van de Geer [33] , we call 

the second term 3||0n^"'^||o'^n//5 the estimation error of the model indexed 
by m. To see why, first put A.„ = k^J\og{Jn) /n. Then, ignoring the log(?i) 
factor, the second term is a function of the sparsity of model m relative 
to the sample size, n. Up to constants, the second term is a "tight" upper 
bound for the estimation error of the OLS estimator from model m, where 
"tight" means that the convergence rate in the bound is the best known rate. 
Note that 6** is the parameter that minimizes the sum of the two errors 
over all models. Such a model (the model corresponding to 0**) is called an 
oracle model. The log(n) factor in the estimation error can be viewed as the 



(4.6) 




(4.7) V{d^)-V{dn)<C' min(L(^>0)-L(Qo) + 3||0||oA2//3) 
where C = (22+3a5i+ac;)i/(2+a) _ 



{1+q)/{2+q) 
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price paid for not knowing the sparsity of the oracle model and thus having 
to conduct model selection. See remark (2) after Theorem 4.3 for the precise 
definition of the oracle model and its relationship to 0**. 

(2) Suppose Xn = o(l). Then in large samples the estimation error term 
3||0||oA,^//3 is negligible. In this case, 0** is close to 6*. When the model 
^6* approximates Qq sufficiently well, we see that setting equal to its 
lower bound in (4.6) provides the fastest rate of convergence of the upper 
bound to zero. More precisely, suppose Qo = ^0* [i.e., L{^6*) — L{Qq) = 0]. 
Then inequality (4.7) implies that V{do)-V{dn) < Op((logn/n)(i+")/{2+«)). 
A convergence in mean result is presented in Corollary 4.1. 

(3) In finite samples, the estimation error 3||0||oA^//3 is nonnegligible. 
The argument of the minimum in the upper bound (4.7), 0**, minimizes 
prediction error among parameters with controlled sparsity. In remark (2) 
after Theorem 4.3, we discuss how this upper bound can be viewed as a tight 
upper bound for the prediction error of the OLS estimator from an oracle 
model in the step-wise model selection setting. In this sense, inequality (4.7) 
implies that the treatment rule produced by the ?i-PLS method will have 
a reduction in Value roughly as if it knew the sparsity of the oracle model 
and were estimated from the oracle model using OLS. 

(4) Assumptions (l)-(3) in Theorem 4.1 are employed to derive the finite 
sample prediction error bound for the /i-PLS estimator On defined in (4.1). 
Below we briefly discuss these assumptions. 

Assumption (1) implicitly implies that the error terms do not have heavy 
tails. This condition is often assumed to show that the sample mean of 
a variable is concentrated around its true mean with a high probability. It is 
easy to verify that this assumption holds if each £{ is bounded. Moreover, it 
also holds for some commonly used error distributions that have unbounded 
support, such as the normal or double exponential. 

Assumption (2) is also used to show the concentration of the sample mean 
around the true mean. It is possible to replace the boundedness condition 
by a moment condition similar to assumption (1). This assumption requires 
that all basis functions and the difference between Qq and its best linear 
approximation are bounded. Note that we do not assume Q to be a good 
approximation space for Qq. However, if ^0* approximates Qq well, rj will 
be small, which will result in a smaller upper bound in (4.7). In fact, in the 
generalized result (Theorem A.l) we allow U and rj to increase in n. 

Assumption (3) is employed to avoid collinearity. In fact, we only need 



for 6, 6' belonging to a subset of M'' (see Assumption A. 3), where Mq{0) = 
{j = I, . . . , J -.Oj ^ 0}. Condition (4.8) has been used in van de Geer [33]. This 
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condition is also similar to the restricted eigenvalue assumption in Bickel, 
Ritov and Tsybakov [3] in which E is replaced by and a fixed design ma- 
trix is considered. Clearly, assumption (3) is a sufficient condition for (4.8). 
In addition, condition (4.8) is satisfied if the correlation \E(j)j(f)k\/{ajak) is 
small for all k € Mq{6), j ^ k and a subset of 0's (similar results in a fixed 
design setting have been proved in Bickel, Ritov and Tsybakov [3]. The 
condition on correlation is also known as "mutual coherence" condition in 
Bunea, Tsybakov and Wegkamp [4]). See Bickel, Ritov and Tsybakov [3] for 
other sufficient conditions for (4.8). 

The above upper bound for V{dQ) — V{dn) involves L{^0) — L{Qq), which 
measures how well the conditional mean function Qq is approximated by Q. 
As we have seen in Section 3, the quality of the estimated ITR only de- 
pends on the estimator of the treatment effect term Tq. Below we provide 
a strengthened result in the sense that the upper bound depends only on 
how well we approximate the treatment effect term. 

First, we identify terms in the linear model Q that approximate Tq (re- 
cah that To{X, A) = Qo{X, A) - E[Qo{X, A)\X]). Without loss of generality, 
we rewrite the vector of basis functions as ^{X,A) = {^^^\X) , ^^^^X , A)) , 
where = {(f)i{X), . . . ,(/>j(i)(X)) is composed of all components in $ that 
do not contain A and <I>*^^-' = {(j)j(i)_^_i{X,A), . . . ,(pj{X,A)) is composed of all 
components in <1> that contain A. Note that A takes only finite values. When 
the randomization distribution p{a\x) does not depend on x, we can code A 
so that £'[$(2)(X,^)^|X] = a.s. (see Section 5.2 and Appendix A.3, for 
examples). For any = (^i, . . . , 9j)'^ G R-^, denote 6^^^ = (6/1, ... , 9jm)'^ and 
e^"^^ = {ej(i)^^,...,ejf. Then approximates E[Qo{X,A)\X] and 

$(2)0(2) approximates Tq. 

The following theorem implies that if the treatment effect term Tq can be 
well approximated by a sparse representation, then dn will have Value close 
to the optimal Value. 

Theorem 4.2. Suppose p{a\x) > for a positive constant S for all 
(x,a) pairs and the margin condition (3.3) holds for some C > 0, a > 
and all positive e. Assume E[(^^'^\X , A)'^ \X] = a.s. Suppose assumptions 
(l)-(3) in Theorem 4-1 hold. Let d^ be the estimated ITR with A„ sat- 
isfying condition (4-6). Let 0„ be the set defined in (4-4)- Then for any 
n > 24J7^ log( Jn) and for which Qn is nonempty, we have, with probability 
at least 1 — 1/n, that 

V{dQ) - V{dn) 

(4.9) 

< C Uin (i?(c^(2)0(2) - nf + 5||0(^)||oA^//3)l ''^''^"'^''\ 



INDIVIDUALIZED TREATMENT RULES 



13 



where C = (22+3«5i+ac;)i/(2+a) _ 

The result follows from inequality (3.5) in Theorem 3.1 and inequality 
(4.11) in Theorem 4.3. 

Remarks. 

(1) Inequality (4.9) improves inequality (4.7) in the sense that it guar- 
antees a small reduction in Value of dn [i.e., V{do) — V[dn)\ as long as the 
treatment effect term Tq is well approximated by a sparse linear represen- 
tation; it does not require a good approximation of the entire conditional 
mean function Qq. In many situations Qo may be very complex, but Tq 
could be very simple. This means that Tq is much more likely to be well 
approximated as compared to Qo (indeed, if there is no difference between 
treatments, then Tq = 0). 

(2) Inequality (4.9) cannot be improved in the sense that if there is no 
treatment effect (i.e., Tq = 0), then both sides of the inequality are zero. 
This result implies that minimizing the penalized empirical prediction error 
indeed yields high Value (at least asymptotically) if Tq can be well approx- 
imated. 

The following asymptotic result follows from Theorem 4.2. Note that 
when£;[«>(2)(X,A)^|X] =Oa.s., L{^e) - L{Qq) = E\^^^'^e^^'^ - E{QQ\X)f + 
£;[<|)(2)0{2) _ 2o]2^ Thus, the estimation of the treatment effect term Tq 
is asymptotically separated from the estimation of the main effect term 
E{Qq\X). In this case, $(2)0(2)'* is the best linear approximation of the 
treatment effect term Tq, where 0(2),* jg ^]^g vector of components in 6* 
corresponding to <I>(2). 

Corollary 4.1. Suppose p{a\x) > for a positive constant S for 
all {x,a) pairs and the margin condition (3.3) holds for some C > 0, a > 
and all positive e. Assume E[(^^'^\X , A)^ \X] =0 a.s. In addition, suppose 

assumptions (l)-(3) in Theorem 4-1 hold. Let dn be the estimated ITR with 
tuning parameter A„ = ki y^log( Jn)/n for a constant ki > 82 max{c, o", r/} . If 
ro(X,yl) = $(2)6>(2).*, then 

V{dQ) - E[V{dn)] = 0((logn/n)(i+-)/(2+")). 

This result provides a guarantee on the convergence rate of V{dn) to the 
optimal Value. More specifically, it means that if Tq is correctly approxi- 
mated, then the Value of dn will converge to the optimal Value in mean at 
rate at least as fast as (logn/n)("'^^")/(2+'^) with an appropriate choice of A^. 
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4.2. Prediction error bound for the li-PLS estimator. In this section, we 
provide a finite sample upper bound for tlie prediction error of the Zi-PLS 
estimator On- This result is needed to prove Theorem 4.1. Furthermore, this 
result strengthens existing literature on Zi-PLS method in prediction. Fi- 
nite sample prediction error bounds for the /i-PLS estimator in the random 
design setting have been provided in Bunea, Tsybakov and Wegkamp [4] 
for quadratic loss, van de Geer [33] mainly for Lipschitz loss, and Koltchin- 
skii [15] for a variety of loss functions. With regards quadratic loss, Koltchin- 
skii [15] requires the response Y is bounded, while both Bunea, Tsybakov 
and Wegkamp [4] and van de Geer [33] assumed the existence of a sparse 
6 € M"^ such that E{^0 — Qo)^ is upper bounded by a quantity that de- 
creases to at a certain rate as n — t- oo (by permitting J to increase with 
n so ^ depends on n as well). We improve the results in the sense that we 
do not make such assumptions (see Appendix A. 2 for results when <I>, J are 
indexed by n and J increases with n). 

As in the prior sections, the sparsity of 9 is measured by its Iq norm, 
||0||o (see the Appendix A. 2 for proofs with a laxer definition of sparsity). 
Recall that the parameter 0** defined in (4.5) has small prediction error and 
controlled sparsity. 

Theorem 4.3. Suppose assumptions (l)-(3) in Theorem 4-1 hold. For 
any rji > 0, let On be the li-PLS estimator defined by (4-1) with tuning 
parameter Xn satisfying condition (4-6). Let 0„ be the set defined in (4-4)- 
Then for any n > 24C/^ log( J?i) and for which 0„ is nonempty, we have, 
with probability at least 1 — 1/n, that 

(4.10) L{^On) < min (L(cl>0) + 3\\0\\oXl/ (3) = L(c^C) + 3||ClloA^//3. 

Furthermore, suppose E[^^'^\X,A)'^\X] =0 a.s. Then with probability at 
least 1 — 1/n, 

(4.11) £;($(2) 0(2) _ 71^)2 < min (£;($(2)0(2) _ 21^)2 ^5||^(2)||^^2/^)^ 

The results follow from Theorem A.l in Appendix A. 2 with p = 0, 7 = 1/8, 
'yi = ^2 = ^) t = log2n and some simple algebra [notice that assumption (3) 
in Theorem 4.1 is a sufficient condition for Assumptions A. 3 and A. 4]. 

Remarks. Inequality (4.11) provides a finite sample upper bound on 
the mean square difference between Tq and its estimator. This result is used 
to prove Theorem 4.2. The remarks below discuss how inequality (4.10) 
contributes to the /i-penalization literature in prediction. 
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(1) The conclusion of Theorem 4.3 holds for all choices of A„ that sat- 
isfy (4.6). Suppose A„, = o(l). Then L{^e*^) - L(^>0*) ^ as n ^ oo (since 
||0||o is bounded). Inequality (4.10) implies that -L($0„) - L{^6*) in 
probability. To achieve the best rate of convergence, equal sign should be 
taken in (4.6). 

(2) Note that 0** minimizes L($0) - L(Qo) + 3||0||oA2//3. Below we de- 
monstrate that the minimum of L{^6) — L{Qq) + 3||0||oA^//3 can be viewed 
as the approximation error plus a "tight" upper bound of the estimation 
error of an "oracle" in the stepwise model selection framework [when "=" 
is taken in (4.6)]. Here "tight" means the convergence rate in the bound is 
the best known rate, and "oracle" is defined as follows. 

Let m denote a nonempty subset of the index set {1, . . . , J}. Then each m 
represents a model which uses a nonempty subset of ■ ■ ■ ,4>j} as basis 
functions (there are 2'' — 1 such subsets). Define 

0(7^)= argmin En{R-^ef 

{em.-' : 0j=O for all j^m} 

and 

0*,(m) ^ argmin L($6>). 

{e<m-' : 6»j=0 for all jfm} 

In this setting, an ideal model selection criterion will pick model m* such that 
L{^elr*^) = mimH^O^rT^). O^n''^ is referred as an "oracle" in Massart [19]. 
Note that the excess prediction error of each can be written as 

L{<^et^) - L(Qo) = [L($0*'(")) - L(Qo)] + [^(^^1"^) - L($r'('"))], 

where the first term is called the approximation error of model m and the 
second term is the estimation error. It can be shown that [1] for each model 
m and Xm > 0, with probability at least 1 — exp(— Xm), 

'xm + \m\ log(?i/|m|)~ 



L{^0^^^) - L($6'*'('")) < constant 



n 



under appropriate technical conditions, where |m| is the cardinality of the 
index set m. To our knowledge, this is the best rate known so far. Taking 
Xm = logn + \m\ log J and using the union bound argument, we have with 
probability at least 1 — 0(l/n), 

Li^Jt*^)-L{Qo) 

= min([L($r'("^)) - L(Qo)] + L($0(™)) - L(^>6>*'("^))) 

m 

< mm( [L(^>6>*'('")) - L(Qo)] + constant x 

my n 

(4.12) = mm([L{^e) - L{Qo)] + constant x MkMM^ . 

e \ n J 
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On the other hand, take A„ so that condition (4.6) holds with "=". Equa- 
tion (4.10) imphes that, with probabihty at least 1 — 1/n, 

Li^On) - L(Qo) < min ( [1(^6) - L(Qo)] + constant x H^Ho ^°g('^^) 
eeSn \ n 

which is essentially (4.12) with the constraint of G 0„. (The ^^constant" in 
the above inequalities may take different values.) Since 6 = 0** minimizes 
the approximation error plus a tight upper bound for the estimation error 
in the oracle model, within 6 G G„, we refer to 0** as an oracle. 

(3) The result can be used to emphasize that li penalty behaves sim- 
ilarly as the Iq penalty. Note that On minimizes the empirical prediction 
error En{R — ^9)^ plus an li penalty, whereas 0** minimizes the prediction 
error L{^6) plus an Iq penalty. We provide an intuitive connection between 
these two quantities. First, note that En{R — ^6)^ estimates L{^0) and aj 
estimates aj. We use "ss" to denote this relationship. Thus, 



(4.13) En{R-<^ef + ^n^^j\Oj\ 

J 



J J 

I "n,j I ) 

i=l j=l 

where 9n,j is the jth component of In Appendix A. 2, we show that 
for any 6 G 0„, A„ ^^^^^ (Tj|^„j — 6j\ is upper bounded by ||0||oA^//3 up 

to a constant with a high probability. Thus, On minimizes (4.13) and 0** 
roughly minimizes an upper bound of (4.13). 

(4) The constants involved in the theorem can be improved; we focused 
on readability as opposed to providing the best constants. 

5. A practical implementation and an evaluation. In this section, we 
develop a practical implementation of the /i-PLS method, compare this 
method to two commonly used alternatives and lastly illustrate the method 
using the motivating data from the Nefazodone-CBASP trial [13]. 

A realistic implementation of Zi-PLS method should use a data-dependent 
method to select the tuning parameter, A„. Since the primary goal is to max- 
imize the Value, we select A„ to maximize a cross validated Value estimator. 
For any ITR d, it is easy to verify that E[{R - V {d))l A=d{x) / PiAX)] = 0. 
Thus, an unbiased estimator of V{d) is 

En[lA=d{X)Rlp{A\X)]/En[lA=d{X)/p{A\X)] 
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[22] [recall that the randomization distribution p{a\X) is known]. We split 
the data into 10 roughly equal-sized parts; then for each we apply the 
^i-PLS based method on each 9 parts of the data to obtain an ITR, and 
estimate the Value of this ITR using the remaining part; the A„ that max- 
imizes the average of the 10 estimated Values is selected. Since the Value 
of an ITR is noncontinuous in the parameters, this usually results in a set 
of candidate A^'s achieving maximal Value. In the simulations below, the 
resulting A^ is nonunique in around 97% of the data sets. If necessary, as 
a second step we reduce the set of A„'s by including only A„'s leading to 
the ITR's using the least number of variables. In the simulations below, this 
second criterion effectively reduced the number of candidate A„'s in around 
25% of the data sets, however multiple A„'s still remained in around 90% 
of the data sets. This is not surprising since the Value of an ITR only de- 
pends on the relative magnitudes of parameters in the ITR. In the third 
step we select the A„ that minimizes the 10-fold cross validated prediction 
error estimator from the remaining candidate A„'s; that is, minimization of 
the empirical prediction error is used as a final tie breaker. 

5.1. Simulations. A first alternative to /i-PLS is to use ordinary least 
squares (OLS). The estimated ITR is doLS £ argmax^ <&(X, a)0oLS where 
^OLS is the OLS estimator of 6. A second alternative is called "progno- 
sis prediction" [14]. Usually this method employs multiple data sets, each of 
which involves one active treatment. Then the treatment associated with the 
best predicted prognosis is selected. We implement this method by estimat- 
ing E{R\X,A = a) via least squares with li penalization for each treatment 
group (each A) separately. The tuning parameter involved in each treat- 
ment group is selected by minimizing the 10-fold cross-validated prediction 
error estimator. The resulting ITR satisfies dpp{X) G argmaX(jg_4 E{R\X, A = 
a) where the subscript "PP" denotes prognosis prediction. 

For simplicity, we consider binary A. All three methods use the same 
number of data points and the same number of basis functions but use these 
data points/basis functions differently. /i-PLS and OLS use all J basis func- 
tions to conduct estimation with all n data points whereas the prognosis 
prediction method splits the data into the two treatment groups and uses 
J/2 basis functions to conduct estimation with the n/2 data points in each 
of the two treatment groups. To ensure the comparison is fair across the 
three methods, the approximation model for each treatment group is con- 
sistent with the approximation model used in both Zi-PLS and OLS [e.g., if 
Qo is approximated by {1, X, A, XA)d in /i-PLS and OLS, then in prognosis 
prediction we approximate E{R\X,A = a) by {l,X)6pp for each treatment 
group]. We do not penalize the intercept coefficient in either prognosis pre- 
diction or /i-PLS. 
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The three methods are compared using two criteria: (1) Value maximiza- 
tion; and (2) simphcity of the estimated ITRs (measured by the number of 
variables/basis functions used in the rule). 

We illustrate the comparison of the three methods using 4 examples se- 
lected to reflect three scenarios (see Section S.3 of the supplemental article 
[25] for 4 further examples): 

(1) There is no treatment effect [i.e., Qq is constructed so that Tq = 0; 
example (1)]. In this case, all ITRs yield the same Value. Thus, the simplest 
rule is preferred. 

(2) There is a treatment effect and the treatment effect term Tq is cor- 
rectly modeled [example (4) for large n and example (2)]. In this case, min- 
imizing the prediction error will yield the ITR that maximizes the Value. 

(3) There is a treatment effect and the treatment effect term Tq is mis- 
specified [example (4) for small n and example (3)]. In this case, there might 
be a mismatch between prediction error minimization and Value maximiza- 
tion. 

The examples are generated as follows. The treatment A is generated 
uniformly from { — 1,1} independent of X and the response R. The response 
R is normally distributed with mean Qq{X,A). In examples (l)-(3), X~ 
[/[— 1,1]^ and we consider three simple examples for Qq. In example (4), 
X~ f/[0,l] and we use a complex Qq, where Qq{X,1) and Q{X,—1) are 
similar to the blocks function used in Donoho and Johnstone [8]. Further 
details of the simulation design are provided in Appendix A. 3. 

We consider two types of approximation models for Qq. In examples (1)- 
(3), we approximate Qq by {1,X,A,XA)0. In example (4), we approximate 
Qq by Haar wavelets. The number of basis functions may increase as n 
increases (we index J, <I> and 6* by n in this case). Plots for Qq{X,A) and 
the associated best wavelet fits ^^(X, A)0* are provided in Figure 1. 

For each example, we simulate data sets of sizes n = 2^ for k = 5, ... ,10. 
1,000 data sets are generated for each sample size. The Value of each esti- 
mated ITR is evaluated via Monte Carlo using a test set of size 10,000. The 
Value of the optimal ITR is also evaluated using the test set. 

Simulation results are presented in Figure 2. When the approximation 
model is of high quality, all methods produce ITRs with similar Value [see 
examples (1), (2) and example (4) for large n]. However, when the approx- 
imation model is poor, the /i-PLS method may produce highest Value [see 
example (3)]. Note that in example (3) settings in which the sample size is 
small, the Value of the ITR produced by /i-PLS method has larger median 
absolute deviation (MAD) than the other two methods. One possible reason 
is that due to the mismatch between maximizing the Value and minimizing 
the prediction error, the Value estimator plays a strong role in selecting A„. 
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Fig. 1. Plots for: the conditional mean function Qq(X,A) fleftj, Qq(X,A) and the asso- 
ciated best wavelet fit when J„ — 8 ( middle ), and Qo{X, A) and the associated best wavelet 
fit when J„ ~ 128 f right j [example (4)]- 

The nonsmoothness of the Value estimator combined with the mismatch re- 
sults in very different A„'s and thus the estimated decision rules vary greatly 
from data set to data set in this example. Nonetheless, the Zi-PLS method 
is still preferred after taking the variation into account; indeed /i-PLS pro- 
duces ITRs with higher Value than both OLS and PP in around 46%, 55% 
and 67% in data sets of sizes n = 32, 64 and 128, respectively. Furthermore, 
in general the /i-PLS method uses much fewer variables for treatment as- 
signment than the other two methods. This is expected because the OLS 
method does not have variable selection functionality and the PP method 
will use all variables that are predictive of the response R whereas the use 
of the Value in selecting the tuning parameter in /i-PLS discounts variables 
that are only useful in predicting the response (and less useful in selecting 
the best treatment). 

5.2. Nefazodone-CBASP trial example. The Nefazodone-CBASP trial 
was conducted to compare the efficacy of several alternate treatments for 
patients with chronic depression. The study randomized 681 patients with 
nonpsychotic chronic major depressive disorder (MDD) to either Nefazodone, 
cognitive behavioral-analysis system of psychotherapy (CBASP) or the com- 
bination of the two treatments. Various assessments were taken throughout 
the study, among which the score on the 24-item Hamilton Rating Scale 
for Depression (HRSD) was the primary outcome. Low HRSD scores are 
desirable. See Keller et al. [13] for more detail of the study design and the 
primary analysis. 

In the data analysis, we use a subset of the Nefazodone-CBASP data 
consisting of 656 patients for whom the response HRSD score was observed. 
In this trial, pairwise comparisons show that the combination treatment 
resulted in significantly lower HRSD scores than either of the single treat- 
ments. There was no overall difference between the single treatments. 
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Fig. 2. Comparison of the l\ -PLS based method with the OLS method and the PP method 
[examples (l)-(4)]: plots for medians and median absolute deviations (MAD) of the Value 
of the estimated decision rules (top panels) and the number of variables (terms) needed 
for treatment assignment (including the main treatment effect term, bottom panels) over 
1,000 samples versus sample size on the log scale. The black dash-dotted line in each 
plot on the first row denotes the Value of the optimal treatment rule, for each example. 
/n = 32, 64, 128, 256, 512, 1024. The corresponding numbers of basis functions m example 
(4) are Jn = 8,16,32,64,64,128.] 



We use /i-PLS to develop an ITR. In the analysis, the HRSD score is re- 
verse coded so that higher is better. We consider 50 pretreatment variables 
X = (Xi, . . . jX^q). Treatments are coded using contrast coding of dummy 
variables A = (^1,^42), where A\ = 2 if the combination treatment is as- 
signed and —1 otherwise and A2 = 1 if CBASP is assigned, —1 if nefa- 
zodone and otherwise. The vector of basis functions, ^{X,A), is of the 
form (^1, X , Ai, X Ai, A2, X A2) ■ So the number of basis functions is J = 153. 
As a contrast, we also consider the OLS method and the PP method (sepa- 
rate prognosis prediction for each treatment). The vector of basis functions 
used in PP is {1,X) for each of the three treatment groups. Neither the 
intercept term nor the main treatment effect terms in /i-PLS or PP is pe- 
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nalized (see Section S.2 of the supplemental article [25] for the modification 
of the weights aj used in (4.1)). 

The ITR given by the ?i-PLS method recommends the combination treat- 
ment to all (so none of the pretreatment variables enter the rule). On the 
other hand, the PP method produces an ITR that uses 29 variables. If the 
rule produced by PP were used to assign treatment for the 656 patients in 
the trial, it would recommend the combination treatment for 614 patients 
and nefazodone for the other 42 patients. In addition, the OLS method will 
use all the 50 variables. If the ITR produced by OLS were used to assign 
treatment for the 656 patients in the trial, it would recommend the combi- 
nation treatment for 429 patients, nefazodone for 145 patients and CBASP 
for the other 82 patients. 

6. Discussion. Our goal is to construct a high quality ITR that will 
benefit future patients. We considered an Zi-PLS based method and provided 
a finite sample upper bound for V{do) — V{dn), the reduction in Value of 
the estimated ITR. 

The use of an li penalty allows us to consider a large model for the 
conditional mean function Qo yet permits a sparse estimated ITR. In fact, 
many other penalization methods such as SCAD [9] and Zi penalty with 
adaptive weights (adaptive Lasso; [37]) also have this property. We choose 
the nonadaptive h penalty to represent these methods. Interested readers 
may justify other PLS methods using similar proof techniques. 

The high probability finite sample upper bounds [i.e., (4.7) and (4.9)] 
cannot be used to construct a prediction/confidence interval for V{do) — 
V{dn) due to the unknown quantities in the bound. How to develop a tight 
computable upper bound to assess the quality of dn is an open question. 

We used cross validation with Value maximization to select the tuning 
parameter involved in the Zi-PLS method. As compared to the OLS method 
and the PP method, this method may yield higher Value when Tq is misspec- 
ified. However, since only the Value is used to select the tuning parameter, 
this method may produce a complex ITR for which the Value is only slightly 
higher than that of a much simpler ITR. In this case, a simpler rule may 
be preferred due to the interpretability and cost of collecting the variables. 
Investigation of a tuning parameter selection criterion that trades off the 
Value with the number of variables in an ITR is needed. 

This paper studied a one stage decision problem. However, it is evident 
that some diseases require time- varying treatment. For example, individu- 
als with a chronic disease often experience a waxing and waning course of 
illness. In these settings, the goal is to construct a sequence of ITRs that 
tailor the type and dosage of treatment through time according to an indi- 
vidual's changing status. There is an abundance of statistical literature in 
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this area [17, 20, 21, 27, 29, 30, 34, 35]. Extension of the least squares based 
method to the multi-stage decision problem has been presented in Murphy 
[21]. The performance of penalization in this setting is unclear and worth 
investigation. 

APPENDIX 

A.l. Proof of Theorem 3.1. For any ITR d:X ^A, denote ATd{X) 
= maxagyt To (X, a) — TQ{X,d{X)). Using similar arguments to that in Sec- 
tion 2, we have V{do) - V{d) = E{ATd). If V{do) - V{d) = 0, then (3.4) and 
(3.5) automatically hold. Otherwise, ^(AT^)^ > {EATa)'^ > 0. In this case, 
for any e > 0, define the event 



= < max Tn (X, a) 



max 

ag.4\arg max^g^ To{X,a) 



To{X,a)<e}. 

a) J 



Then AT^ < {ATdf/e on the event Of. This together with the fact that 
ATd < {ATdf/e + e/4 implies 

V{do) - V{d) = E{l^cATd) + E{lnATd) 



<-E[l^c{ATdf]+E 



If^j 



[ATdf 



+ 



-E[{ATdf] + >(a) < -E[{ATdf\ + -ei+- 



e 

>l(A.,)^] + 4- , 4 

where the last inequality follows from the margin condition (3.3). Choosing 
e= (4£^(AT^)^/C)^/(^"'""^ to minimize the above upper bound yields 

(A.l) V{dQ) - V{d) < 2"/(2+")ci/(2+")[^(Ard)2](i+")/(2+"). 

Next, for any d and Q such that d{X) G maXag_4(5(X, a), let T[X,A) be 
the associated treatment effect term. Then 



E{ATdf = E 



< 2E 



maxTo{X, a) - maxT{X, a) + T{X, d{X)) - To{X, d{X)) 
\aeA aeA 

( max Tn (X, a) — max T(X, < 
\aeA aeA 



< 4E 



+ {TiX,d{X))-ToiX,d{X))Y 



max(T(X, a) — To(X, a))^ 
- aeA 



where the last inequality follows from the fact that neither [max^ Tq{X, a) 
maXar(A', a)| nor \T{X,d{X)) — TQ{X,d{X))\ is larger than maXa|T(X, a) 
ro(X, a)|. Since p{a\x) > for all {x,a) pairs, we have 



E{ATdf<4SE[Y^{T{X,a) 
aeA 



ToiX,a)yp{a\X) 



(A.2) 



■ iSE{T{X,A)-ToiX,A)y. 



INDIVIDUALIZED TREATMENT RULES 



23 



Inequality (3.5) follows by substituting (A. 2) into (A.l). Inequality (3.4) 
can be proved similarly by noticing that ATrf(X) = max^g^ Qo (-'^j o) — 
Qo{X,d{X)). 

A.2. Generalization of Theorem 4.3. In this section, we present a gener- 
alization of Theorem 4.3 where J may depend on n and the sparsity of any 
6 G M"^ is measured by the number of "large" components in 6 as described 
in Zhang and Huang [36]. In this case, J, $ and the prediction error mini- 
mizer 0* are denoted as Jn,^n and 0*, respectively. All relevant quantities 
and assumptions are restated below. 

Let \M\ denote the cardinality of any index set MC{1,...,J„}. For any 
€ M"^" and constant p>0, define 

MpA„(6')G argmin |M|. 

{MC{l,...„/„}:E,g{i,...,j„}\M^.I^'jl<P|A^l^n} 

Then Mpx^{6) is the smallest index set that contains only "large" compo- 
nents in 0. \Mp\^{6)\ measures the sparsity of 0. It is easy to see that when 
p = 0, M{){6) is the index set of nonzero components in and |Mo(0)| = 
\\0\\q. Moreover, Mpx„{6) is an empty set if and only ii 6 = 0. 

Let [0* ] be the set of most sparse prediction error minimizers in the linear 
model, that is, 

(A.3) [61]= argmin \MpX,Ae)\. 

0Gargming L(^„d) 

Note that [0*] depends on pXn- 

To derive the finite sample upper bound for L{^n^n), we need the follow- 
ing assumptions. 

Assumption A.l. The error terms £i,i = l,...,n are independent of 
{Xi,Ai),i = 1, . . . ,n and are i.i.d. with E{£i) = and -^[lejl'] < Ic^'^a^ for 
some c, cr^ > for all / > 2. 

Assumption A. 2. For all n > 1: 

(a) there exists an 1 < [/„ < oo such that maxj=i^...^j^ ll'/'j lloo/c^j < Un, 
where aj = (E0|) 1/2. 

(b) there exists an < rji^n < oo, such that sup^gj^*] ||Qo — 'I'n^lloo < ?/i,n- 

For any < 7 < 1/2, 'q2,n > (which may depend on n) and tuning pa- 
rameter A„, define 

e° = |0GM-^":3rG s.t. ||$n(0 - < r?2,n 



and max 

j = l,...,Jn 



E 
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Assumption A. 3. For any n > 1, there exists a /3„ > such that 

< 27+5 



E[<^n{0-e)]^\MpxA0)\>l3n 



for all G e° \ {0}, G M-^" satisfying Eje{i, 



+ p|MpAj0)|A„). 



When Ei^'nHx^Ay^X) = a.s. (^>^''^ is defined in Section 4.1), we need 
an extra assumption to derive the finite sample upper bound for the mean 
square error of the treatment effect estimator E[^n^6n^ — To{X,A)]'^ (recall 
that To iX,A)^ Qo {X, A) - E [Qo {X,A)\X]). 

Assumption A. 4. For any n > 1, there exists a /3„ > such that 

S[$(2)(0-(2)_^(2))]2|^(2)(^)| 



(2) 



>/3„ 



< 



27+5 



for all 6> G e° \ {0}, 6 G R-^" satisfying Eje{i,...,j„}\A/,,„(0) '^jl-il - T=¥i 
{T.jeMpxje) ~ %l + Pl^pA„(^)|A„,), where 



arg mm 



{MC{j(^)+l,...,J„}:E. , ri)^^ <x,|ej|<p|M|A„} 

]<i{Jn +1,..., Jn}\A-f 

is the smallest index set that contains only large components in 0^'^\ 

Without loss of generality, we assume that Assumptions A.S and A. 4 hold 
with the same value of And we can always choose a small enough /3„ so 
that pPn < 1 for a given p. 

For any given t > 0, define 



|M| 



e„ = ^GGMM,Aj0) 



(A.4) 



< 



(l-27)^/3n 
120 



n 



9 2C/2[log(3J„(J„ + l))+t] 3 



Note that we allow Un,r]i^n, 112,71 and /3^^ to increase as n increases. How- 
ever, if those quantities are small, the upper bound in (A. 7) will be tighter. 
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Theorem A.l. Suppose Assumptions A.l and A. 2 hold. For any given 
< 7 < 1/2, r/2,n > 0, p > and t > 0, let On be the li-PLS estimator defined 
in (4-1) with tuning parameter 

8max{3c,2(7yi,„ + ?72,n)}^n(log6Jn + t) 
- (1 - 27)n 

(A.5) 



12max{o-, {r]i,n + V2,n)} / 2(log6Jn + t) 
(1-27) V n 

Suppose Assumption A. 3 holds with pPn < 1- Let 0„ be the set defined in 
(A. 4) and assume 0„ is nonempty. If 

(A.6) 



n - 27C/2 - IO7 - 22 ' 
then with probability at least 1 — exp(— fc^n) — exp(— t), we have 



(A.7) L{^nOn) < mill 



Pn 



where k'^ = 13(1 - 2-if/[%{27Ul - IO7 - 22)] and Kn = [407(12/3„/9 + 27 + 
5)]/[(l - 27)(27 + 19)] + 130(12/3„p + 27 + 5)V[9(27 + 19)^] . 

Furthermore, suppose E{^'h\x,A)^\X) = a.s. If Assumption A. 4 holds 
with the same (3n o,s that in Assumption A.S, then with probability at least 
1 — exp(— fc^n) — exp(— i), we have 

I A/f (f))\ 

Ei^^X^ - Tof < min - T.f + K'J-^^Xl 

where K'n = 20(12/3„p + 27 + 5){7/[(l - 27)(7 - 6/3„p)] + [3(1 - 27)/3„p + 
10(27 + 5)]/[9(27 + 19)2]}. 

Remarks. 

(1) Note that Kn is upper bounded by a constant under the assumption 
f^nP ^ 1- In the asymptotic setting when n — )• 00 and J„ — )• 00, (A.7) im- 
phes that Li^nOn) - minggigj^ L($„6>) i f (i) |MpAj6>° )|A^//3„ = o(l), 
(ii) [/^ log J„/n < ki and \Mpx^{e°)\ < /c2/3n vW(f^PogX) for some suffi- 
ciently small positive constants ki and /c2 and (iii) A„ > ^3 max{l, r/i^„ + 
r/2,n}-\/log Jn/n for a sufficiently large constant k^, where 6" G [0*] (take 
t = log Jn). 

(2) Below we briefly discuss Assumptions A.2-A.4. 

Assumption A. 2 is very similar to assumption (2) in Theorem 4.1 (which is 
used to prove the concentration of the sample mean around the true mean), 
except that f/„ and r]i^n may increase as n increases. This relaxation allows 
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the use of basis functions for which the sup norm maxj ||(/'j||oo is increasing 
in n [e.g., the wavelet basis used in example (4) of the simulation studies]. 

Assumption A. 3 is a generalization of condition (4.8) [which has been 
discussed in remark (4) following Theorem 4.1] to the case where Jn may 
increase in n and the sparsity of a parameter is measured by the number 
of "large" components as described at the beginning of this section. This 
condition is used to avoid the collinearity problem. It is easy to see that 
when /9 = and /3„ is fixed in n, this assumption simplifies to condition (4.8). 

Assumption A. 4 puts a strengthened constraint on the linear model of the 
treatment effect part, as compared to Assumption A. 3. This assumption, to- 
gether with Assumption A. 3, is needed in deriving the upper bound for the 
mean square error of the treatment effect estimator. It is easy to verify that 
if £^[$^$„] is positive definite, then both Assumptions A. 3 and A. 4 hold. 
Although the result is about the treatment effect part, which is asymp- 

(2) 

totically independent of the main effect of X (when E[^):' {X,A)\X] = 
a.s.), we still need Assumption A. 3 to show that the cross product term 
En[{^\l^e^n^ - $i^^6'(i))($i^^0i^^ - $i^^6>(2))] is upper bounded by a quan- 
tity converging to at the desired rate. We may use a really poor model for 
the main effect part E{Qq{X,A)\X) (e.g., = 1), and Assumption A. 4 
implies Assumption A. 3 when p = 0. This poor model only effects the con- 
stants involved in the result. When the sample size is large (so that A„ 
is small), the estimated ITR will be of high quality as long as Tq is well 
approximated. 



Proof of Theorem A.l. For any G define the events 



Jn f 
7 = 1 



^ii±^a, < a, < ^^a, \ [where a, 4 {E^c^]fl\ 



max 

j,k=l,...,J„ 



iE-E„ 



(Pj(pk 



max 

j = l,...,Jn 



En 



< 



< 



(l-27)^/?n 

12O|M,A„(0)| 
47 + 1 



6 



-A, 



Then there exists a 0° € \6*] such that 



L($„0„) = Li^nd) + 2E[{'^nd° - $„0)$„(0 - On)] + E[^n{On " 0)f 

Jn \ 



< L{^nO) + 2 max 

j = l,...,Jr. 

+ E[^n{en-e)f 



E 
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where the first equahty follows from the fact that E[{R — = for 

any 6° G [0* ] for j = 1 , . . . , J„ and the last inequality follows from the defi- 
nition of G°. 

Based on Lemma A. 1 below, we have that on the event VLir\Vt2{0)r\Vt^{6), 

Pn 

Similarly, when =0, by Lemma A.2, we have that on 

the event n ^2(0) n ^3(0), 



IM^^^ (0)\ 

Pn 

The conclusion of the theorem follows from the union probability bounds 
^ the events f^i, r22(^) and ^^.^{O) provided in Lemmas A. 3, A. 4 and A. 5. 

Below we state the lemmas used in the proof of Theorem A.l. The proofs 
of the lemmas are given in Section S.4 of the supplemental article [25]. 

Lemma A.l. Suppose Assumption A. 3 holds with pf3n < 1- Then for any 
9 € Qn, on the event Vti n ^2{9) H Q.^{9), we have 

(AR) V^lfl f}\< 20(12p/3n + 27 + 5) 
and 

(A.9) EmOn - 0)f < 130(12p/?. + 27 + 5)^ 2 

V ; I n\ j\ _ 9(i9 + 27)2^„ ' ''^"^ ^' " 

Remark. This lemma implies that 0„, is close to each 9 G G„ on the 
event ^}irii}2i9) 0^3(9). The intuition is as follows. Since 0„ minimizes (4.1), 
the first order conditions imply that maxj \En{R— ^n9n)4'j/(^j \ ^ A.„/2. Sim- 
ilar property holds for 9 on the event fli D 03(0). Assumption A. 3 together 
with event 1^2(0) ensures that there is no collinearity in the n x design 
matrix (<I>.„(Xj, ^j))f^]^. These two aspects guarantee the closeness of 9n 
to 9. 
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Lemma A. 2. Suppose E[^n\x,A)'^\X] = a.s. and Assumptions A. 3 
and A. 4 hold with pf3n < 1. Then for any 9 G 0„, on the event Qi n ft2{0) n 
^3(0), we have 

(A.IO) y a,|4,-0,|<,l^ii%±^l±^|M(J(0)|A. 



and 
(A.ll) 



2 



. 20(12p/3„ + 27 + 5) [3(1 - 27)/3„p + 10(27 + 5)] , ^^(2) , , 2 
9(27 + 19)2/3„ l^pA. WIAn- 



Lemma A.S. Suppose Assumption A.2{a.) and inequality (A. 6) hold. 
Then P(l^f ) < exp(-A:», where k'^ = 13(1 - 2-ff/[6{27U^ - IO7 - 22)]. 

Lemma A. 4. Suppose Assumption A.2{a) holds. Then for any t> and 

e^en, p({O2(0)}^)<exp(-t)/3. 

Lemma A. 5. Suppose Assumptions A.l and A. 2 hold. For any t > 0, if 
Xn satisfies condition (A. 5), then for any 6 G B„, we have P{{Q3{6)}^) < 
2exp(-t)/3. 

A.S. Design of simulations in Section 5.1. In this section, we present 
the detailed simulation design of the examples used in Section 5.1. These 
examples satisfy all assumptions listed in the theorems [it is easy to verify 
that for examples (l)-(3). Validity of the assumptions for example (4) is 
addressed in the remark after example (4)]. In addition, 0„, defined in (4.4) 
is nonempty as long as n is sufficiently large (note that the constants in- 
volved in Qn can be improved and are not that meaningful. We focused on 
a presentable result instead of finding the best constants). 

In examples (l)-(3), X = (Xi, . . . , X5) is uniformly distributed on [—1, 1]^. 
The treatment A is then generated independently of X uniformly from 
{—1,1}. Given X and A, the response R is generated from a normal distribu- 
tion with mean Qq(X, yl) = 1 + 2Xi + X2 + O.5X3 + Tq{X, A) and variance 1. 
We consider the following three examples for Tq: 

(1) Tq{X,A) = (i.e., there is no treatment effect). 

(2) To{X,A) = 0.424(1 - Xi - X2)A. 

(3) To{X,A) = 0.446 sign(Xi)(l - X^fA. 

Note that in each example Tq(X,A) is equal to the treatment effect term, 
Qo{X,A) - E[Qo{X,A)\X]. We approximate Qo by Q = {{1, X, A, XA)e : 
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6 G M"'^'^}. Thus, in examples (1) and (2) the treatment effect term Tq is 
correctly modeled, while in example (3) the treatment effect term Tq is 
misspecified. 

The parameters in examples (2) and (3) are chosen to reflect a medium 
effect size according to Cohen's d index. When there are two treatments, 
the Cohen's d effect size index is defined as the standardized difference in 
mean responses between two treatment groups, that is, 

_ E{R\A = 1)-E{R\A = -1) 

~ ([Var(ii| A = 1) + Var(i?|^ = -1)]/2)V2 " 

Cohen [7] tentatively defined the effect size as "small" if the Cohen's d index 
is 0.2, "medium" if the index is 0.5 and "large" if the index is 0.8. 

In example (4), X is uniformly distributed on [0,1]. Treatment A is 
generated independently of X uniformly from {—1,1}. The response R is 
generated from a normal distribution with mean Qq{X,A) and variance 1, 
where Qo{X,l) = E|=i ^(i),jlx<«(i)^, , Qo{X,-l) = T!j=i'^{-i),j'^x<u^_^^^, 
and i?'s and u's are parameters specified in (A. 12). The effect size is small: 

(l?(l),l,---,1?(l),8) 

= (-0.781, 0.730, 0.635, 0.512, -2.278, 1.347, 1.155, -0.030); 

('(?(_l),l,...,??(_l),8) 

= (-2.068,1.520,-0.072, 
(A.12) -0.637, 1.003, -0.611, -0.305, 1.016); 

('«(l),l,---,'«(l),8) 

= (0.028, 0.144, 0.171, 0.298, 0.421, 0.443, 0.463, 0.758); 

(tX(_l),l,...,U(„i)^8) 

= (0.061, 0.215, 0.492, 0.544, 0.6302, 0.650, 0.785, 0.909). 
We approximate Qq by Haar wavelets, 

e^o)fiho{X) + 9^o)^MX) + Uo),iho{X) + ^iHikhikiX)] A, 

Ik ^ ife ^ 

where /io(x) = l^g[o,i] and hikix) = 2^^^{l2ixe[k+i/2,k+i) - h'x&[k,k+i/2)) for 
I = 0, . . . ,ln, and e M are parameters. We choose In = [31og2n/4j — 
2. For a given I and sample {Xi, Ai, Ri)"^^^, k takes integer values from 
[2' min, X,\ to max^ Xi]-1. Then J„ = 2L3i°g2"/4j < j^3/4_ 

Remark. In example (4), we allow the number of basis functions J„ to 
increase with n. The corresponding theoretical result can be obtained by 
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combining Theorems 3.1 and A.l. Below we demonstrate the validation of 
the assumptions used in the theorems. 

Theorem 3.1 requires that the randomization probability p{a\x) > 
for a positive constant for all (x,a) pairs and the margin condition (3.3) or 
(3.6) holds. According the generative model, we have that p{a\x) = 1/2 and 
condition (3.6) holds. 

Theorem A.l requires Assumptions A.1-A.4 hold and 0„ defined in (A. 4) 
is nonempty. Since we consider normal error terms, Assumption A.l holds. 
Note that the basis functions used in Haar wavelet are orthogonal. It is also 
easy to verify that Assumptions A. 3 and A. 4 hold with /3„ = 1 and Assump- 
tion A. 2 holds with Un = and ryi.„ < constant + constant x ||0^||o 
[since each \(l)j0^j\ = \(l)jE{(j)jR)\ < constant x < 0(1)]. Since Qq 
is piece-wise constant, we can also verify that ||0^||o < O(logn). Thus, for 
sufficiently large n, 0„ is nonempty and (A. 6) holds. The RHS of (A. 5) 
converges to zero as ?i — )■ oo. 
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SUPPLEMENTARY MATERIAL 

Supplement to "Performance guarantees for individualized treatment 
rules" (DOI: 10.1214/10-AOS864SUPP; .pdf). This supplement contains 
four sections. Section S.l discusses the problem with over-fitting due to the 
potentially large number of pretreatment variables (and/or complex approx- 
imation space for Qq) mentioned in Section 4. Section S.2 provides modifi- 
cations of the /i-PLS estimator 0„, when some coefficients are not penalized 
and discusses how to obtain results similar to inequality (A. 7) in this case. 
Section S.3 provides extra four simulation examples based on data from the 
Nefazodone-CBASP trial [13]. Section S.4 provides proofs of Lemmas A.l- 
A.5. 
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